Data Science in Environmental Science and Oceanography with R and Python

Course Notes

Interactive course notes for workshop of Informatica Feminale 25
Authors

Dana Ransby

Kathrin Riemann-Campe

Maren Rebke

Diren Senger

(Alfred-Wegener-Institut Helmholtz Zentrum für Polar- und Meeresforschung)

Published

August 20, 2025

1 About

This guide was created for the course “Data Science in Environmental Science and Oceanography with R and Python” at the Informatica Feminale 2025.

2 Agenda

2.1 Agenda - Day 1: PANGAEA and R & Python basics

Aug 20th Topic Presenter
10:00 Round of introductions (20 min) all including participants
10:00 Intro PANGAEA, data repos (20 min) DR
10:20 How to find and use data from PANGAEA (15 mins) KRC
10:35 Quiz (10 min) KRC
10:45 How to download datasets (10 min) KRC
10:55 Questions I (10 min) KRC & DR
11:05 break
11:15 Intro to PANGAEA packages and tokens (20-25 min) DR
11:35 Questions II (10min) DR & KRC
11:45 Lunch break
02:00 basics R and python DS
04:30 end of day 1

2.2 Agenda - Day 2: Getting data from PANGAEA und Data Cleaning

Aug 21st Topic Presenter
09:00 Downloading and manipulating PANGAEA data in R - in brief DR
09:45 Minibreak
09:50 Downloading and manipulating PANGAEA data in Python (100 min including bio break after 45 min) KRC
12:30 Lunch break
02:00 Data Cleaning in python KRC
03:40 Data Cleaning in R DS
04:30 end of day 2

2.3 Agenda - Day 3: Plotting und Statistics

Aug 22nd Topic Presenter
09:00 Intro and Cleaning of Penguin data set, Plotting & Statistics DS & MR
12:30 Lunch break
02:00 Markdown & time for open Questions all
04:30 end of day 3

3 Before the course

3.1 Installation Notes

If you haven’t installed R and RStudio yet, use following instructions for the installation:

  1. Install R:

Installation on Windows and Mac https://www.dataquest.io/blog/installing-r-on-your-computer/ For Linux https://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed-_0028Unix_002dlike_0029

  1. Install the free RStudio Desktop Version: https://www.rstudio.com/products/rstudio/#Desktop

  2. Do you know how you can install new R packages? You can read about it here: https://thedatacommunity.org/2017/09/17/installing-loading-unloading-and-removing-r-packages-in-rstudio/

In this script, we make use of following packages:

  • pangaear
  • dplyr
  • tidyr
  • stringr
  • lubridate
  • httr
  • worrms
  • knitr
  • magrittr
  • car
  • palmerpenguins
  • reshape
  • ggplot2
  • (reticulate)

Very many integrated development environment (IDE) software programs exist for Python. For this beginners course, we recommend the free and open source Python IDE called Thonny.

  1. Install Python:
    Thonny is available for Windows, Mac and Linux. Just download the suitable executable installation file for your computer and start the installation. The basic set up of Python is done. When you open Thonny for the first time, you will see the default set up of windows with a code editor on the upper left side, a shell on the lower left side and an assistant on the right hand side: Thonny default set up

  2. Install additional Python packages:
    During our course, we will use several additional packages. All can be installed either via the package installer PIP.
    Example: Write the following code in the shell to install PANGAEApy.

!pip install pangaeapy

Or click in the upper left hand corner of Thonny on Tools and select Manage packages …. Search jupyter on PyPI and click on the jupyter search result to start the installation.

The following packages are needed for this course:
- jupyter
- pangaeapy
- numpy
- pandas
- re
- os
- datetime
- ipdb
- requests
- LatLon23
- difflib
- collections
- matplotlib
- plotly

3.2 Programming Environments

For R, we recommend RStudio.

3.2.0.1 Creating a new project

In many statistics classes they teach you, that you should set a working directory:

setwd("~/Documents/")

It can be helpful to create a project instead. All files within the project’s directory will automatically have the correct working directory. In this way you can create new files and start coding easily.

  1. Click File -> New Project… Even when you choose “Existing Directory” you will still be able to create a new directory Creating a Project, Step 1
  2. Enter the diretory path straight away, or click on browse. Creating a Project, Step 2
  3. Choose an existing directory or create a new one Creating a Project, Step 3

3.2.0.2 RStudio screen

  • In the upper left hand corner you have the code editor. This is where you can type your scripts.
  • In the bottom left hand corner you have the console. Here you will see the results of all commands you are running.
  • In the upper right hand corner you can see all objects you created, e.g. data frames, functions, strings, …
  • In the bottom left hand corner you will be able to see the files in your project, your plots and more information e.g. about functions RStudio screen

You can use “Ctrl” followed by a number to jump to a specific window. If you want to get into the terminal use “Shift+Alt+t” or “Shift+Alt+m”. You can find an overview of all short cuts using “Shift+Alt+k”. RStudio screen

3.2.0.3 R Workflow

  • You type your scripts in the code editor, so that you can save them.
  • If you want to run the whole script, you can click on ‘Source’ at the tope of the code editor or press “Ctrl + Alt + R”. (Shortcuts might differ depending on your system.)
  • Use “Ctrl + Alt + B/E” to run from the beginning to the current line or from the current line to the end.
  • If you want to run a single line, you can click on ‘Run’ or press “Control + Enter”
  • If you want to run a few lines, highlight the lines you want to run and use ‘Run’/ “Control + Enter”
  • If you want to use Autocomplete, you can press “Control + Space”
  • If you want to read about an existing method, you can type “?<Name of the method>” to get to the documentation

3.2.0.4 Save your code in a script or notebook

Python code is saved either in scripts with .py ending or in so called Jupyter Notebooks with .ipynb ending. There is no difference in the python code itself only in the way the code is executed.

  • Files with .py ending need to be executed in a shell or terminal, somewhere you type the command to execute your script. If you want to know more about the differences between a shell and a terminal, have a look at GeeksforGeeks.

  • Files with .ipynb ending consist of individual cells which can be executed directly in the Jupyter Notebook.

  • Note: If you open a Jupyter Notebook in an editor, you will see that the python code is mixed with additional code to interprete each notebook cell. You can convert .py files to .ipynb files an vice versa using a command line tool in the shell as described in the nbconvert documentation.

In this course we will work with Jupyter Notebook. To open a Jupyter environment write in the shell window of Thonny:
!jupyter notebook

After pressing enter, a browser window will open to show you your Home directory. Jupyter home Double click on your Jupyter Notebook of choice to open it or open a new one with the File button, select new and Notebook. Select the suggested Kernel and start scripting. New Notebook
Type anything you want, e.g. print(‘Hello World’), into the cell and execute it by clicking on the button with the single triangle.

3.3 Data types

3.3.0.1 Character

firstname <- "Pipi"
surname <- "Langstrumpf"
name <- paste(firstname, surname)
name
[1] "Pipi Langstrumpf"

3.3.0.2 Numeric

a <- 1
b <- 2
c <- a + b
c
[1] 3
d <- b*c/a
d
[1] 6
e <- d^2
e
[1] 36
f <- as.integer(3.14)
f
[1] 3

3.3.0.3 Logical

a <- TRUE
b <- FALSE

# or
a|b
[1] TRUE
# and
a&b
[1] FALSE
# not
!a
[1] FALSE
4 == 5
[1] FALSE
4 < 5
[1] TRUE
4>=4
[1] TRUE
is.na
function (x)  .Primitive("is.na")
as.numeric(a)
[1] 1

3.3.0.4 && and ||

c(T,T,T) & c(T,F,F)
[1]  TRUE FALSE FALSE
c(T,T,T) && c(T,F,F)
Error in c(T, T, T) && c(T, F, F): 'length = 3' in coercion to 'logical(1)'
IDontExist
Error: object 'IDontExist' not found
T | IDontExist
Error: object 'IDontExist' not found
T || IDontExist
[1] TRUE

3.3.0.5 Vectors

# a simple numeric vector
myvector <- c(1,4,6)
myvector
[1] 1 4 6
# a vector of type character
myfamily <- c("Paula", "Ali", "Emma")
myfamily
[1] "Paula" "Ali"   "Emma" 
# get the first element of a vector
myfamily[1]
[1] "Paula"
# apply function to all items:
myvector + 1
[1] 2 5 7
paste(myfamily, "Meier")
[1] "Paula Meier" "Ali Meier"   "Emma Meier" 
# concatenate vectors:
longervector <- c(myvector, 5:7)
longervector
[1] 1 4 6 5 6 7
# create a sequence
odd <- seq(from = 1, to = 10, by = 2)
odd
[1] 1 3 5 7 9
# create a boring sequence
boring <- rep(1, 10)
boring
 [1] 1 1 1 1 1 1 1 1 1 1

3.3.0.6 Factors

fac <- factor(c("good", "better", "best"))
levels(fac)
[1] "best"   "better" "good"  
nlevels(fac)
[1] 3

3.3.0.7 List of lists

You can create a list of lists (a list of vectors):

myList <- list(smallLetters=letters[1:10], 
               capitalLetters=LETTERS[1:10], 
               numbers=1:5)
myList
$smallLetters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

$capitalLetters
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

$numbers
[1] 1 2 3 4 5

If you want to choose lists, you can do it with a single bracket.

# Accessing multiple elements
myList[1:2]
$smallLetters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

$capitalLetters
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

If you want to choose single elements, you need double brackets:

# Accessing single elements
myList[2][2]
$<NA>
NULL
myList[[2]][2]
[1] "B"
myList[[1:2]]
[1] "b"

3.3.0.8 Matrices

m <- matrix(data=1:12, nrow = 3, ncol = 4)
m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Element wise operations:

m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Element-wise operations
m * 0.5
     [,1] [,2] [,3] [,4]
[1,]  0.5  2.0  3.5  5.0
[2,]  1.0  2.5  4.0  5.5
[3,]  1.5  3.0  4.5  6.0
m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Element-wise operations
m * c(1,2,3)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    4   10   16   22
[3,]    9   18   27   36

matrix/ vector multiplication

m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Matrix multiplication
m %*% c(1,2,3,4)
     [,1]
[1,]   70
[2,]   80
[3,]   90

Good explanation of Matrix multiplication: Eli Bendersky’s website

3.3.0.9 Data frames

# Creating a data frame
family.frame <- data.frame(index = 1:3, firstname = myfamily, surname = rep("Meier", 3))
library(knitr)
kable(family.frame)
index firstname surname
1 Paula Meier
2 Ali Meier
3 Emma Meier

A comprehensive list of data types is given in the Python documentation.

3.3.0.10 Character aka String

firstname = 'Pipi'
surname = 'Langstrumpf'

name = firstname+surname
print(name)
PipiLangstrumpf
name = firstname+' '+surname
print(name)
Pipi Langstrumpf

3.3.0.11 Numeric

import numpy as np

a = 1
b = 2
c = a + b
print(c)
3
d = b*c/a
print(d)
6.0
e = d ** 2
print(e)
36.0
f = int(3.14)
print(f)
3
g = float(3.14)
print(g)
3.14
h = np.nan
float(h)
nan

3.3.0.12 Logical

a = True
b = False

# or
a or b
True
# and
a and b
False
# not
not a
False
4 == 5
False
4 < 5
True
4 >= 4
True

3.3.0.13 Vectors aka list

# a simple numeric vector is similar to a list in python
myvector = [1,4,6]
print(myvector)
[1, 4, 6]
# a vector of type character
myfamily = ["Paula", "Ali", "Emma"]
print(myfamily)
['Paula', 'Ali', 'Emma']
# get the first element of a vector
myfamily[0]
'Paula'
# append another item:
myfamily.append("Meier")
print(myfamily)
['Paula', 'Ali', 'Emma', 'Meier']
# create a sequence
odd = list(range(1, 10, 2)) # start, stop, step
print(odd)
[1, 3, 5, 7, 9]
# create a boring sequence
boring = list(range(10))
print(boring)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# concatenate vectors:
longervector = myvector + odd
print(longervector)
[1, 4, 6, 1, 3, 5, 7, 9]

3.3.0.14 Matrices aka Array

import numpy as np
m = np.arange(1, 13).reshape(3, 4)  
print(m)
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

Element wise operations:

# Element-wise operations
result = m * 0.5
print(result)
[[0.5 1.  1.5 2. ]
 [2.5 3.  3.5 4. ]
 [4.5 5.  5.5 6. ]]
m = np.arange(1, 13).reshape(3, 4)
n = np.array([1, 2, 3, 4])

# Element-wise operations
result = m * n
print(result)
[[ 1  4  9 16]
 [ 5 12 21 32]
 [ 9 20 33 48]]

matrix/ vector multiplication

m = np.arange(1, 13).reshape(3, 4)
v = np.array([1, 2, 3, 4])

# Matrix multiplication
result = np.dot(m, v)
print(result)
[ 30  70 110]
# or:
result = np.matmul(m, v)
print(result)
[ 30  70 110]

Good explanation of Matrix multiplication: Eli Bendersky’s website

3.3.0.15 Dataframes

Dataframes in Python are defined in the package Pandas.

import pandas as pd
# Creating a data frame from a list

myfamily = ['Paula', 'Ali', 'Emma']
lastname = ['Meier']*3
family_frame = pd.DataFrame(list(zip(myfamily,lastname)),columns=['firstname','surname'])
family_frame
  firstname surname
0     Paula   Meier
1       Ali   Meier
2      Emma   Meier

4 PANGAEA - Introduction

The slides of the PANGAEA theory part are stored in github.

5 Basics R and Python

5.1 Preparing your data for analysis

5.1.0.1 Reading the data

Usually, you would use the “csv” format to read data into R. Sometimes you have other formats, e.g. “txt”. Optimally, you have exactly one line for the header. Make sure to choose the correct separator. You also have to tell R whether your file has a header or not.

df <- read.table("beedata.csv")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 2 elements

That did not work. Oh, yes, the separator!

df <- read.table("beedata.csv", sep = ",")
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
time hive h t_i_1 t_i_2 t_i_3 t_i_4 t_i_5 t_o weight_kg
1.565081418e+18 4 NA 27.25 26.375 23.4375 23.5 23.25 21.625 NA
1.565081428e+18 4 NA 27.25 26.4375 23.5 23.5 23.25 21.625 NA
1.565081431e+18 13 59.0554 22.875 22.1875 23 21.8125 22.6875 22.875 10.779
1.565081438e+18 4 NA 27.25 26.375 23.5 23.5625 23.25 21.625 NA

Looks ok, but we forgot the heading:

df <- read.table("beedata.csv", sep = ",", header = T)
time hive h t_i_1 t_i_2 t_i_3 t_i_4 t_i_5 t_o weight_kg
1.565081e+18 4 NA 27.250 26.3750 23.4375 23.5000 23.2500 21.625 NA
1.565081e+18 4 NA 27.250 26.4375 23.5000 23.5000 23.2500 21.625 NA
1.565081e+18 13 59.0554 22.875 22.1875 23.0000 21.8125 22.6875 22.875 10.779
1.565081e+18 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA
1.565081e+18 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA

That looks good!

Alternatively, we can use “read.csv”. In this case, the default parameters work for us.

df <- read.csv("beedata.csv")

Take care, sometimes you don’t get an error, but your data is screwed up anyway: sample Data

df <- read.csv("sampleData.csv", sep = " ")
name. age. subject. year
Li; 18; chemistry; 1
Svenson,Jacob; 25; psychology; 12
Raphaela; 23; psychology; 1

Use the head function to inspect your data.

5.1.0.2 Transforming entire columns

The columns/ variables of a dataframe basically behave like entries in a named list. Each element of this list is a vector of a specific type. You can inspect the structure of a dataframe using the function “str”.

df <- read.csv("beedata.csv")
str(df)
'data.frame':   171299 obs. of  10 variables:
 $ time     : num  1.57e+18 1.57e+18 1.57e+18 1.57e+18 1.57e+18 ...
 $ hive     : int  4 4 13 4 4 4 4 4 4 4 ...
 $ h        : num  NA NA 59.1 NA NA ...
 $ t_i_1    : num  27.2 27.2 22.9 27.2 27.2 ...
 $ t_i_2    : num  26.4 26.4 22.2 26.4 26.4 ...
 $ t_i_3    : num  23.4 23.5 23 23.5 23.5 ...
 $ t_i_4    : num  23.5 23.5 21.8 23.6 23.6 ...
 $ t_i_5    : num  23.2 23.2 22.7 23.2 23.2 ...
 $ t_o      : num  21.6 21.6 22.9 21.6 21.6 ...
 $ weight_kg: num  NA NA 10.8 NA NA ...

E.g. converting kg to g:

df$weight_g <- df$weight_kg*1000

Marking high temperature values:

df$highTemp <- df$t_i_1>25
time hive h t_i_1 weight_kg weight_g highTemp
1.565081e+18 4 NA 27.250 NA NA TRUE
1.565081e+18 4 NA 27.250 NA NA TRUE
1.565081e+18 13 59.0554 22.875 10.779 10779 FALSE
1.565081e+18 4 NA 27.250 NA NA TRUE
1.565081e+18 4 NA 27.250 NA NA TRUE

Dealing with the timestamp (nanoseconds to seconds)

df$time <- as.POSIXct(df$time/1000000000, origin="1970-01-01")
time hive h t_i_1 t_i_2 t_i_3 t_i_4 t_i_5 t_o weight_kg weight_g highTemp
2019-08-06 10:50:18 4 NA 27.250 26.3750 23.4375 23.5000 23.2500 21.625 NA NA TRUE
2019-08-06 10:50:28 4 NA 27.250 26.4375 23.5000 23.5000 23.2500 21.625 NA NA TRUE
2019-08-06 10:50:31 13 59.0554 22.875 22.1875 23.0000 21.8125 22.6875 22.875 10.779 10779 FALSE
2019-08-06 10:50:38 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA NA TRUE
2019-08-06 10:50:48 4 NA 27.250 26.3750 23.5000 23.5625 23.2500 21.625 NA NA TRUE

5.1.0.3 Reading data

One easy way to read ascii data in Python is via Pandas. Many different formats can be read, e.g. “csv”, “txt” or “xlsx”. Ideally, you have exactly one line for the header. Data formats can be very divers. However, there are as many options to set before reading the data. Make sure to choose the correct separator.

# import package
import pandas as pd

# open the data as dataframe 
df = pd.read_csv('beedata.csv')

# use head function to see the first 5 rows of the dataframe
df.head()
           time  hive        h   t_i_1  ...    t_i_4    t_i_5     t_o  weight_kg
0  1.565081e+18     4      NaN  27.250  ...  23.5000  23.2500  21.625        NaN
1  1.565081e+18     4      NaN  27.250  ...  23.5000  23.2500  21.625        NaN
2  1.565081e+18    13  59.0554  22.875  ...  21.8125  22.6875  22.875     10.779
3  1.565081e+18     4      NaN  27.250  ...  23.5625  23.2500  21.625        NaN
4  1.565081e+18     4      NaN  27.250  ...  23.5625  23.2500  21.625        NaN

[5 rows x 10 columns]
# use describe function to get an overview of the values of the dataframe
df.describe()
               time           hive  ...            t_o      weight_kg
count  1.712990e+05  171299.000000  ...  155142.000000  111106.000000
mean   1.565408e+18       8.382956  ...      18.709573      28.435138
std    1.668141e+14       4.523060  ...      31.252958      10.019183
min    1.565081e+18       3.000000  ...      -0.062500     -59.093070
25%    1.565285e+18       4.000000  ...      14.625000      24.154370
50%    1.565363e+18      12.000000  ...      18.250000      24.832320
75%    1.565592e+18      12.000000  ...      20.937500      39.548892
max    1.565686e+18      14.000000  ...    2072.500000      69.451870

[8 rows x 10 columns]

5.1.0.4 Transforming entire columns

The columns/ variables of a dataframe are called dataseries. Each item of each dataseries is of a specific type. You can inspect the structure of a dataframe using the function “info”.

# open the data as dataframe 
df = pd.read_csv('beedata.csv')

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171299 entries, 0 to 171298
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   time       171299 non-null  float64
 1   hive       171299 non-null  int64  
 2   h          95156 non-null   float64
 3   t_i_1      155193 non-null  float64
 4   t_i_2      155203 non-null  float64
 5   t_i_3      155232 non-null  float64
 6   t_i_4      155148 non-null  float64
 7   t_i_5      133220 non-null  float64
 8   t_o        155142 non-null  float64
 9   weight_kg  111106 non-null  float64
dtypes: float64(9), int64(1)
memory usage: 13.1 MB
# you can call each dataseries via it's header, e.g. 
df['weight_kg']
0              NaN
1              NaN
2         10.77900
3              NaN
4              NaN
            ...   
171294    24.31022
171295         NaN
171296    24.31170
171297    24.30817
171298         NaN
Name: weight_kg, Length: 171299, dtype: float64

E.g. converting kg to g:

df['weight_g'] = df['weight_kg']*1000.

df.head()
           time  hive        h   t_i_1  ...    t_i_5     t_o  weight_kg  weight_g
0  1.565081e+18     4      NaN  27.250  ...  23.2500  21.625        NaN       NaN
1  1.565081e+18     4      NaN  27.250  ...  23.2500  21.625        NaN       NaN
2  1.565081e+18    13  59.0554  22.875  ...  22.6875  22.875     10.779   10779.0
3  1.565081e+18     4      NaN  27.250  ...  23.2500  21.625        NaN       NaN
4  1.565081e+18     4      NaN  27.250  ...  23.2500  21.625        NaN       NaN

[5 rows x 11 columns]

Marking high temperature values:

df['highTemp'] = df['t_i_1']>25

Data can have various date and time formats. Here, we have timesteps of nanoseconds since 1970-01-01. Thus, dealing with the timestamp (nanoseconds to seconds)

df['time'] = pd.to_datetime(df['time'], unit='ns', origin=pd.Timestamp('1970-01-01'))

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171299 entries, 0 to 171298
Data columns (total 12 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   time       171299 non-null  datetime64[ns]
 1   hive       171299 non-null  int64         
 2   h          95156 non-null   float64       
 3   t_i_1      155193 non-null  float64       
 4   t_i_2      155203 non-null  float64       
 5   t_i_3      155232 non-null  float64       
 6   t_i_4      155148 non-null  float64       
 7   t_i_5      133220 non-null  float64       
 8   t_o        155142 non-null  float64       
 9   weight_kg  111106 non-null  float64       
 10  weight_g   111106 non-null  float64       
 11  highTemp   171299 non-null  bool          
dtypes: bool(1), datetime64[ns](1), float64(9), int64(1)
memory usage: 14.5 MB
df.head()
                 time  hive        h  ...  weight_kg  weight_g  highTemp
0 2019-08-06 08:50:18     4      NaN  ...        NaN       NaN      True
1 2019-08-06 08:50:28     4      NaN  ...        NaN       NaN      True
2 2019-08-06 08:50:31    13  59.0554  ...     10.779   10779.0     False
3 2019-08-06 08:50:38     4      NaN  ...        NaN       NaN      True
4 2019-08-06 08:50:48     4      NaN  ...        NaN       NaN      True

[5 rows x 12 columns]

Take care, sometimes you don’t get an error, but your data is screwed up anyway: sample Data

df.head()
                 time  hive        h  ...  weight_kg  weight_g  highTemp
0 2019-08-06 08:50:18     4      NaN  ...        NaN       NaN      True
1 2019-08-06 08:50:28     4      NaN  ...        NaN       NaN      True
2 2019-08-06 08:50:31    13  59.0554  ...     10.779   10779.0     False
3 2019-08-06 08:50:38     4      NaN  ...        NaN       NaN      True
4 2019-08-06 08:50:48     4      NaN  ...        NaN       NaN      True

[5 rows x 12 columns]

Use the head function to inspect your data.

5.2 Subsetting data

You can use squared brackets and boolean expressions to choose lines and columns: df[<expression to choose lines>,<expression to choose columns>]

Select and plot data from hive 4:

df.4 <- df[df$hive==4,]
plot(df.4$time, df.4$t_o, ylim=c(0,40))

Select only the first 1000 lines:

df.4.sub <- df.4[1:1000,]
plot(df.4.sub$time, df.4.sub$t_o, ylim=c(0,40))

Delete columns: / Choose only some columns

names(df)
 [1] "time"      "hive"      "h"         "t_i_1"     "t_i_2"     "t_i_3"    
 [7] "t_i_4"     "t_i_5"     "t_o"       "weight_kg" "weight_g"  "highTemp" 
df.some <- df[, c(1,2,10)]
time hive weight_kg
2019-08-06 10:50:18 4 NA
2019-08-06 10:50:28 4 NA
2019-08-06 10:50:31 13 10.779
2019-08-06 10:50:38 4 NA
2019-08-06 10:50:48 4 NA

You can also use:

df.same <- df[, c("time", "hive", "weight_kg")]
time hive weight_kg
2019-08-06 10:50:18 4 NA
2019-08-06 10:50:28 4 NA
2019-08-06 10:50:31 13 10.779
2019-08-06 10:50:38 4 NA
2019-08-06 10:50:48 4 NA

or

df.or <- df[, -c(3:9,11)]
time hive weight_kg highTemp
2019-08-06 10:50:18 4 NA TRUE
2019-08-06 10:50:28 4 NA TRUE
2019-08-06 10:50:31 13 10.779 FALSE
2019-08-06 10:50:38 4 NA TRUE
2019-08-06 10:50:48 4 NA TRUE

5.2.0.1 Useful functions

  • summary
  • head
  • tail
  • sum
  • mean

You can use squared brackets and boolean expressions to choose lines and columns.

Select and plot data from hive 4:

df = pd.read_csv('beedata.csv')
df_4 = df[df['hive']==4]

import matplotlib.pyplot as plt

# quick plot with default arguments
plt.plot(df_4['time'],df_4['t_o'])

For more plotting options see https://matplotlib.org/stable/gallery/index.html

fig, ax = plt.subplots()

ax.plot(df_4['time'],df_4['t_o'],'x')

ax.set(ylim=(0, 40))
[(0.0, 40.0)]
plt.show()

Select only the first 1000 lines:

df_4_sub = df_4[1:1000]

# combine head and atil and show the first and last 2 rows
pd.concat([df_4_sub.head(2),df_4_sub.tail(2)])
              time  hive   h    t_i_1  ...    t_i_4    t_i_5      t_o  weight_kg
1     1.565081e+18     4 NaN  27.2500  ...  23.5000  23.2500  21.6250        NaN
3     1.565081e+18     4 NaN  27.2500  ...  23.5625  23.2500  21.6250        NaN
1145  1.565091e+18     4 NaN  31.5625  ...  29.0625  28.8125  27.3750        NaN
1146  1.565091e+18     4 NaN  31.6250  ...  29.1250  28.8125  27.3125        NaN

[4 rows x 10 columns]

Delete columns: / Choose only some columns

df.columns
Index(['time', 'hive', 'h', 't_i_1', 't_i_2', 't_i_3', 't_i_4', 't_i_5', 't_o',
       'weight_kg'],
      dtype='object')
df_some = df[["time", "hive", "weight_kg"]] 
pd.concat([df_some.head(2),df_some.tail(2)])
                time  hive  weight_kg
0       1.565081e+18     4        NaN
1       1.565081e+18     4        NaN
171297  1.565686e+18    12   24.30817
171298  1.565686e+18     4        NaN

5.2.0.2 Useful functions

  • info
  • head
  • tail
  • describe
  • columns

5.3 Exercises - Data

## Exercises about data types

# 1. Create following sequences: 
# 4, 8, 12, 16, 20, 24, 28
# 10, 8, 6, 4, 2, 0
# repeat "chocolate" 10 times in a vector


# 2. Starting with the following code, divide a by b and save it as variable c:

a <- "5555555"
b <- 3


# 3. Starting with following code, create the string/ character vector "group3trial1.csv"

group <- 3
trial <- 1


## Exercises about data frames

# 1. Read the data "allForarexData.csv"

# 2. Convert Exp0_OxygenTemp and Exp1_OxygenTemp. Devide by 1000. Save the result in the same column.

# 3. What are the means of Exp0_OxygenTemp and Exp1_OxygenTemp?


# 4. Save the timestamp in the correct format in UTC. The time was measured in seconds. (origin="1970-01-01")


# 5. When was the first measurement? When was the last measurement?

# 6. Create a subset containing only data measured after 2019-03-09 09:25:52.

# 7. Difficult: Exclude columns, where all measurements are the same. 
# hint at the bottom of the file


## Exercises about both data frames and data types

# Use beedata.csv

# 1. Select the lines of the bee data set where the hive number is 13 or 14.

# 2. What are the classes of the beedata's columns hive and t_i_1?

# 3. Difficult: How can you print the classes of all columns?







# Hint
# hint "Data types", 7: one option is to check if the min value of a column equals the max value. You can use a loop or following function: sapply(flight, function(x){min(x)!=max(x)})

bla

## Exercises about data types

# 1. Create following sequences: 
# 4, 8, 12, 16, 20, 24, 28
# 10, 8, 6, 4, 2, 0
# repeat "chocolate" 10 times in a vector

seq(4,28,4)
seq(10,0,-2)
rep("chocolate", 10)


# 2. Starting with the following code...:

a <- "5555555"
b <- 3


c <- as.numeric(a) / b


# 3. Starting with following code, create the filename "group3trial1.csv"

group <- 3
trial <- 1

paste("group", group, "trial", trial, ".csv", sep="")


## Exercises about data frames

# 1. Read the data "allForarexData.csv"
flight <- read.table("allForarexData.csv", header = T, sep = ",")

# 2. Convert Exp0_OxygenTemp and Exp1_OxygenTemp. Devide by 1000. Save the result in the same column.
flight$Exp0_OxygenTemp <- flight$Exp0_OxygenTemp/1000
flight$Exp1_OxygenTemp <- flight$Exp1_OxygenTemp/1000

# 3. What are the means of Exp0_OxygenTemp and Exp1_OxygenTemp?
mean(flight$Exp0_OxygenTemp)
mean(flight$Exp1_OxygenTemp)

# 4. Save the timestamp in the correct format in UTC. The time was measured in seconds.
flight$timeStamp <- as.POSIXct(flight$timeStamp, origin="1970-01-01", tz = "UTC")

# 5. When was the first measurement? When was the last measurement?
min(flight$timeStamp)
max(flight$timeStamp)

# 6. Create a subset containing only data measured after 2019-03-09 09:25:52.

flight.subtime <- flight[flight$timeStamp>"2019-03-09 09:25:52",]

# 7. Difficult: Exclude columns, where all measurements are the same. 

# hint: one option is using: sapply(flight, function(x){min(x)!=max(x)})

ncol(flight)
flight.sub <- flight[, sapply(flight, function(x){min(x)!=max(x)})]
ncol(flight.sub)

## Exercises about both data frames and data types

# Use beedata.csv

# 1. Select the lines of the bee data set where the hive number is 13 or 14.

df.13or14 <- df[df$hive==13|df$hive==14,]


# 2. What are the classes of the beedata's columns hive and t_i_1?

class(df$hive)
class(df$t_i_1)

# 3. Difficult: How can you print the classes of all columns?

lapply(df, class)

5.4 Conditions - If and Else

5.4.0.1 If

Simple if:

a <- 19
if(a >= 18){
  print("Yes, you are allowed to see this content.")
}
[1] "Yes, you are allowed to see this content."

if and else:

goodWeather <- TRUE
if (goodWeather){
  print("Let's go to the beach!")
} else{
  print("Let's eat pizza.")
}
[1] "Let's go to the beach!"

Else-if:

do.you.want.this <- "chocolate"
#do.you.want.this <- "cookies"
#do.you.want.this <- "carrots"
if (do.you.want.this == "chocolate"){
  print("Yes, I love chocolate!")
} else if (do.you.want.this == "cookies"){
  print("Yes, if they are with chocolate.")
} else {
  print("Hm, not sure. Do you have chocolate?")
}
[1] "Yes, I love chocolate!"
library(lubridate)
birthday <- as.POSIXct("2019-08-06 10:50:18")
what.do.you.want.to.know <- "year"
#what.do.you.want.to.know <- "month"
#what.do.you.want.to.know <- "chocolate"
if (what.do.you.want.to.know == "year"){
  print(year(birthday))
} else if (what.do.you.want.to.know == "month"){
  print(month(birthday))
} else {
  print("Sorry, what do you want to know?")
}
[1] 2019

5.4.0.2 If

Note: if, else, etc. do not use (),{} or [] to mark its beginning and endings. Python uses indentations.

Simple if:

a = 19
if a >= 18:
    print("Yes, you are allowed to see this content.")
Yes, you are allowed to see this content.

if and else:

goodWeather = True
if goodWeather:
    print("Let's go to the beach!")
else:
    print("Let's eat pizza.")
Let's go to the beach!

elif (short for else if):

do_you_want_this = "chocolate"
#do_you_want_this = "cookies"
#do_you_want_this = "carrots"
if do_you_want_this == "chocolate":
    print("Yes, I love chocolate!")
elif do_you_want_this == "cookies":
    print("Yes, if they are with chocolate.")
else:
    print("Hm, not sure. Do you have chocolate?")
Yes, I love chocolate!
from datetime import datetime
birthday = datetime.fromisoformat('2019-08-06 10:50:18')

what_do_you_want_to_know = "year"
#what_do_you_want_to_know = "month"
#what_do_you_want_to_know = "chocolate"
if what_do_you_want_to_know == "year":
    print(birthday.year)
elif what_do_you_want_to_know == "month":
    print(birthday.month)
else:
    print("Sorry, what do you want to know?")
2019

5.5 For and While

5.5.0.1 For

ages <- c(31, 29, 5)
total.age <- 0
for(i in ages){
  total.age <- total.age + i
}
total.age
[1] 65
myfamily <- list(Paula=31, Ali=29, Emma=5)
age <- 0
for (i in 1:length(myfamily)){
  age <- age + myfamily[[i]]
}
age
[1] 65

apply it to columns:

for (i in 1:ncol(df)){
  name <- names(df)[i]
  colclass <- class(df[,i])
  print(paste(name, ":", colclass))
}
[1] "time : POSIXct" "time : POSIXt" 
[1] "hive : integer"
[1] "h : numeric"
[1] "t_i_1 : numeric"
[1] "t_i_2 : numeric"
[1] "t_i_3 : numeric"
[1] "t_i_4 : numeric"
[1] "t_i_5 : numeric"
[1] "t_o : numeric"
[1] "weight_kg : numeric"
[1] "weight_g : numeric"
[1] "highTemp : logical"

transform long code …

df <- read.csv("beedata.csv")
print("mean of h:")
mean(df$h)
print("mean of t_i_1:")
mean(df$t_i_1)
print("mean of weight_kg:")
mean(df$weight_kg)
print("missing h values:")
sum(is.na(df$h))
print("missing t_i_1 values:")
sum(is.na(df$h))
print("missing weight_kg values:")
sum(is.na(df$h))

… into shorter code

to.analyse <- c("h", "t_i_1", "weight_kg")
for(i in 1:length(to.analyse)){
  print(paste("mean of", to.analyse[i], ":"))
  print(mean(to.analyse[i]))
  print(paste("missing", to.analyse[i], "values:"))
  print(sum(is.na(df[,to.analyse[i]])))
}

5.5.0.2 While

a <- 2
while (a<1000){
  a <- a^2
  print(a)
}
[1] 4
[1] 16
[1] 256
[1] 65536

5.5.0.3 Do-while

a <- 10
repeat{
  a <- a - sqrt(a)
  print(a)
  if(sqrt(a) > a){
    break
  }
}
[1] 6.837722
[1] 4.222818
[1] 2.167869
[1] 0.6955003

bad example for Do-while:

a <- 10
repeat{
  a <- a - sqrt(a)
  print(a)
  if(a < 0){
    break
  }
}
[1] 6.837722
[1] 4.222818
[1] 2.167869
[1] 0.6955003
[1] -0.1384663

because it breaks for edge cases:

a <- -10
repeat{
  a <- a - sqrt(a)
  print(a)
  if(a < 0){
    break
  }
}
Warning in sqrt(a): NaNs produced
[1] NaN
Error in if (a < 0) {: missing value where TRUE/FALSE needed

5.5.0.4 For vs. While

  • if you do not know the number of iterations needed, use while
  • else, always use for!
  • it is easier to create a while-loop that never stops, than creating a for-loop that never stops.

5.5.0.5 For

Note: for, while, etc. do not use (),{} or [] to mark its beginning and endings. Python uses indentations.

ages = [31, 29, 5]
total_age = 0

for i in ages:
    total_age = total_age + i
print(total_age)
65

same as:

ages = [31, 29, 5]
total_age = 0

for age in ages:
    total_age = total_age + age
print(total_age)
65

The next example uses a dictionary with {key:value} structure.

myfamily = {'Paula':31, 'Ali':29, 'Emma':5}
age = 0
for key, value in myfamily.items():
    age = age + value
print(age)
65

apply it to dataframes:

import pandas as pd

df = pd.read_csv('beedata.csv')

for ind,value in df['t_i_1'].items():
    if ind > 6 and ind <= 10:
        print('ind = ',ind)
        print('value = ',value)
ind =  7
value =  27.25
ind =  8
value =  27.25
ind =  9
value =  27.25
ind =  10
value =  22.6875

transform long code …

import pandas as pd

df = pd.read_csv('beedata.csv')

print("mean of h:")
mean of h:
df['h'].mean()
np.float64(56.397063812371265)
print("mean of t_i_1:")
mean of t_i_1:
df['t_i_1'].mean()
np.float64(25.847728319382963)
print("mean of weight_kg:")
mean of weight_kg:
df['weight_kg'].mean()
np.float64(28.43513778723021)
print("missing h values:")
missing h values:
df['h'].isna().sum()
np.int64(76143)
print("missing t_i_1 values:")
missing t_i_1 values:
df['t_i_1'].isna().sum()
np.int64(16106)
print("missing weight_kg values:")
missing weight_kg values:
df['weight_kg'].isna().sum()
np.int64(60193)

… into shorter code

to_analyse = ["h", "t_i_1", "weight_kg"]
for i in to_analyse:
    print("mean of", i, df[i].mean())
    print("sum of missing values of", i, df[i].isna().sum())
mean of h 56.397063812371265
sum of missing values of h 76143
mean of t_i_1 25.847728319382963
sum of missing values of t_i_1 16106
mean of weight_kg 28.43513778723021
sum of missing values of weight_kg 60193

5.5.0.6 While

a = 2
while (a<1000):
    a = a**2
    print(a)
4
16
256
65536

5.5.0.7 Do-while

import math

a = 10
while True:
    a = a - math.sqrt(a)
    print(a)
    if math.sqrt(a) > a:
        break
6.83772233983162
4.222818452528758
2.167868708002445
0.6955003070910275

bad example for Do-while:

import math

a = 10
while True:
    a = a - math.sqrt(a)
    print(a)
    if a < 0:
        break
6.83772233983162
4.222818452528758
2.167868708002445
0.6955003070910275
-0.13846630320642772

because it breaks for edge cases:

import math

a = -10
while True:
    try:
        a = a - math.sqrt(a)
    except ValueError:
        print("math domain error: sqrt(a) with negative a is not defined")
        break
    print(a)
    if a < 0:
        break
math domain error: sqrt(a) with negative a is not defined

5.5.0.8 For vs. While

  • if you do not know the number of iterations needed, use while
  • else, always use for!
  • it is easier to create a while-loop that never stops, than creating a for-loop that never stops.

5.6 Exercises - Conditions, For and While

# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea print 
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".

Ali <- 5555/111
Bea <- 10*5




## For and While

#  1. Loop from 1 to 15 and print the square-root of 
# each number

# 2. The function "rbinom(1,1,0.9)" models a very unfair
# coin, which shows the head (1) with a probability 
# of 90 %. Flip this coin until you get tail (0). 
# Count how many times you flipped the coin.

# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string. 

# hint at the bottom of this file



# 4. Difficult: Read in all files in the directory 
# "forarexData" and combine them to a single data frame.

# hint at the bottom of this file






# Hints


# Watch out, spoiler alert!

# hint "For and While", 3:  Check out the functions "substr" and "nchar".





# hint "For and While", 4:  Useful functions: list.files, try or try-catch, rbind



# hint "Functions" 1: in case you did not find the solution. This is the solution:

# Watch out this is a solution for another task!


s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
  for(j in 1:3){
    s1 <- paste(s1, substr(s, i, i), sep = "")
  }
}
s1
## Exercises - Conditions

# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea print 
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".

Ali <- 5555/111
Bea <- 10*5

if (Ali>Bea){
  print("Ali is older than Bea.")
} else{
  print("Bea is older than Ali.")
}

# 2. The price of a small bar of chocolate is 1 Euro. 
# For a medium bar of chocolate you have to pay 3 Euros 
# and for a large bar you have to pay 5 Euros.
# Somebody tells you how much money she has. 
# Tell her, how much chocolate she can buy 
# (Eg. print "You can buy a small bar of chocolate.")
# She does not want to buy more than one bar of chocolate, 
# but the largest possible.

money <- 14/3

if (money >= 5){
  print("You can buy a large bar of chocolate.")
} else if (money >= 3){
  print("You can buy a medium bar of chocolate.")
} else if (money >= 1){
  print("You can buy a small bar of chocolate.")
} else{
  print("Sorry, you can't buy any chocolate.")
}

## For and While

#  1. Loop from 1 to 15 and print the square-root of 
# each number

for (i in 1:15){
  print(sqrt(i))
}

# 2. The function "rbinom(1,1,0.9)" models a very unfair
# coin, which shows the head (1) with a probability 
# of 90 %. Flip this coin until you get tail (0). 
# Count how many times you flipped the coin.

coin <- 1
index <- 0
while (coin!=0){
  coin <- rbinom(1,1,0.9)
  index <- index +1
}
index

# or

index <- 0
repeat{
  coin <- rbinom(1,1,0.9)
  index <- index +1
  if(coin==0){
    break
  }
}
index

# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string. 

# hint:  Check out the functions "substr" and "nchar".

s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
  for(j in 1:3){
    s1 <- paste(s1, substr(s, i, i), sep = "")
  }
}
s1


# 4. Difficult: Read in all files in the directory 
# "forarexData" and combine them to a single data frame.

# hint:  Useful functions: list.files, try or try-catch, rbind

dir <- "forarexData/"
files <- list.files(dir)
files <- paste(dir, files, sep="")

df<-data.frame()

for (i in 1:length(files)) {
  tmp <- read.csv(file = files[i])
  df <- rbind(df,tmp)
}
# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea, print 
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".

Ali = 5555/111
Bea = 10*5


# 2. The price of a small bar of chocolate is 1 Euro. 
# For a medium bar of chocolate you have to pay 3 Euros 
# and for a large bar you have to pay 5 Euros.
# Somebody tells you how much money she has. 
# Tell her, how much chocolate she can buy 
# (Eg. print "You can buy a small bar of chocolate.")
# She does not want to buy more than one bar of chocolate, 
# but the largest possible.

money = 14/3


## For and While

# 1. Loop from 1 to 15 and print the square-root of 
# each number

# hint: you need the package numpy for square-roots
import numpy as np

# 2. The function "np.random.binomial(n=1, p=0.9, size=1)" models a very unfair
# coin, which shows the head (1) with a probability 
# of 90 %. Flip this coin until you get tail (0). 
# Count how many times you flipped the coin.

# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string. 

# hint at the bottom of this file



# 4. Difficult: Read in all files in the directory 
# "forarexData" and combine them to a single data frame.

# hint at the bottom of this file













# Hints


# Watch out, spoiler alert!

# hint "For and While", 3:  The function "len" works also for strings.





# hint "For and While", 4:  useful functions: os.listdir, os.path.join(), pd.concat

import os
import pandas as pd
## Exercises - Conditions

# 1. Imagine somebody tells you the age of two of her friends,
# Ali and Bea. If Ali is older than Bea, print 
# "Ali is older than Bea.". If Bea is older than Ali,
# print "Bea is older than Ali.".

Ali = 5555/111
Bea = 10*5

if Ali > Bea:
    print("Ali is older than Bea.")
else:
    print("Bea is older than Ali.")


# 2. The price of a small bar of chocolate is 1 Euro. 
# For a medium bar of chocolate you have to pay 3 Euros 
# and for a large bar you have to pay 5 Euros.
# Somebody tells you how much money she has. 
# Tell her, how much chocolate she can buy 
# (Eg. print "You can buy a small bar of chocolate.")
# She does not want to buy more than one bar of chocolate, 
# but the largest possible.

money = 14/3

if money >= 5:
    print("You can buy a large bar of chocolate.")
elif money >= 3:
    print("You can buy a medium bar of chocolate.")
elif money >= 1:
    print("You can buy a small bar of chocolate.")
else:
    print("Sorry, you can't buy any chocolate.")


## For and While

# 1. Loop from 1 to 15 and print the square-root of 
# each number

# you need the package numpy to compute square-roots
import numpy as np

for i in range(1,16):
    print(np.sqrt(i))

    
# 2. The function "np.random.binomial(n=1, p=0.9, size=1)" models a very unfair
# coin, which shows the head (1) with a probability 
# of 90 %. Flip this coin until you get tail (0). 
# Count how many times you flipped the coin.

coin = 1
index = 0

while coin!=0:
    coin = np.random.binomial(n=1, p=0.9, size=1)
    index = index +1

print(index)



# 3. Repeat each character of the string "Summer" 3 times.
# Combine all characters to a new string. 

s = "summer"
s1 = ""

for i in range(len(s)):
    for j in range(3):
        s1 += s[i]

print(s1)


# 4. Difficult: Read in all files in the directory 
# "forarexData" and combine them to a single data frame.

# hint:  Useful functions: os.listdir, os.path.join(), pd.concat

import os
import pandas as pd

dir = "forarexData/"
files = os.listdir(dir)
files = [os.path.join(dir, f) for f in files]

df = pd.DataFrame()
for file in files:
    tmp = pd.read_csv(file)
    df = pd.concat([df, tmp], ignore_index=True)

print(df)

5.7 Writing own functions

The basic syntax to write functions looks like this:

myPlus <- function(a, b){
  return(a + b)
}
myPlus(1,2)
[1] 3

You can use default parameters:

myPlus <- function(a=1, b=1){
  return(a + b)
}
myPlus(1,2)
[1] 3
myPlus()
[1] 2

If you want to return multiple values, you can use a named list:

dataframe.info <- function(df){
  cells.count <- ncol(df)*nrow(df)
  return(list(columns=ncol(df), rows=nrow(df), cells= cells.count))
}
dataframe.info(df)
$columns
[1] 12

$rows
[1] 171299

$cells
[1] 2055588

The basic syntax to write functions looks like this:

def myPlus(a, b):
    c = a+b
    return c
    
myPlus(1,2)
3

You can use default parameters:

def myPlus(a=1, b=1):
    c = a+b
    return c
    
myPlus(1,2)
3
myPlus()
2

Example to return multiple values:

def dataframe_info(df):

    rows_count = df.shape[0]
    columns_count = df.shape[1]
    values_count = df.shape[0]*df.shape[1]

    return columns_count, rows_count, values_count

import pandas as pd
df = pd.read_csv('beedata.csv')

dataframe_info(df)
(10, 171299, 1712990)

5.8 Exercises: Functions

# 1. In one of the previous tasks you wrote some code to
# Repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter 
#   should be repeated.
# Use “summer” and 3 as default values.

# hint at the bottom of this file


# 2. We look at the data from the beehives again. 
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.

# 3. Difficult-follow-up: In a previous exercise, you wrote some code to read
# in all files within a directory. 
# Convert the code to a function.








# Hints
# hint "Functions" 1: in case you did not find the solution. This is the solution:

# Watch out this is a solution for another task!


s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
  for(j in 1:3){
    s1 <- paste(s1, substr(s, i, i), sep = "")
  }
}
s1
# 1. In one of the previous tasks you wrote some code to
# Repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter 
#   should be repeated.
# Use “summer” and 3 as default values.
# hint: in case you did not find the solution. This is the solution:

# Watch out this is a solution for another task!


s <- "summer"
s1 <- ""
for(i in 1:nchar(s)){
  for(j in 1:3){
    s1 <- paste(s1, substr(s, i, i), sep = "")
  }
}
s1

funny.words <- function(word=summer, repeat.times=3) {
  s1 <- ""
  for(i in 1:nchar(word)){
    for(j in 1:repeat.times){
      s1 <- paste(s1, substr(word, i, i), sep = "")
    }
  }
  return(s1)
}

# 2. We look at the data from the beehives again. 
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.

df <- read.csv("beedata.csv")

beehive.info <- function(df, hive){
  sub <- df[df$hive==hive,]
  n <- nrow(sub)
  missing <- sum(is.na(sub$weight_kg))
  return(list(n=n, missing=missing))
}
beehive.info(df, 4)
beehive.info(df, 13)

# 3. In a previous exercise, you wrote some code to read
# in all files within a directory. 
# Convert the code to a function.


combine <- function(dir="forarexData/", finalName="forarexCombined.csv"){
  # Purpose: the function can combine the files and save to a new csv
  # Parameters:
  #  dir: the directory of the files to be combined
  #  finalName: the name of the file to be used to save the csv
  
  # list all files in directory
  files <- list.files(dir)
  files <- paste(dir, files, sep="")
  
  df<-data.frame()
  
  for (i in 1:length(files)) {
    tmp <- read.csv(file = files[i])
    df <- rbind(df,tmp)
  }
  write.csv(df, finalName)
}

combine()

### Other option with try

combine.try <- function(dir="forarexData/", finalName="forarexCombined.csv"){
  # Purpose: the function can combine the files and save to a new csv
  # Parameters:
  #  dir: the directory of the files to be combined
  #  finalName: the name of the file to be used to save the csv
  
  # list all files in directory
  files <- list.files(dir)
  files <- paste(dir, files, sep="")
  
  df<-data.frame()
  
  for (i in 1:length(files)) {
    try({
      tmp <- read.csv(file = files[i])
      df <- rbind(df,tmp)
    })
  }
  write.csv(df, finalName)
}
combine.try()

# option with trycatch, in case some files are completly screwed up

combine.trycatch <- function(dir="forarexData/", finalName="forarexCombined.csv"){
  # Purpose: the function can combine the files and save to a new csv
  # Parameters:
  #  dir: the directory of the files to be combined
  #  finalName: the name of the file to be used to save the csv
  
  # list all files in directory
  files <- list.files(dir)
  files <- paste(dir, files, sep="")
  
  df<-data.frame()
  
  for (i in 1:length(files)) {
    tryCatch({
      tmp <- read.csv(file = files[i])
      df <- rbind(df,tmp)
    },
    warning = function(warning_condition){
      print(paste("Some weird formatting in file", files[i]))
      print(warning_condition)
    },
    # handling of errors, if there was some wrong formating
    error = function(error_condition){
      print(paste("Skipping file", files[i]))
      print(error_condition)
    })
  }
  write.csv(df, finalName)
}
combine.trycatch()
# 1. In one of the previous tasks you wrote some code to
# repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter 
#   should be repeated.
# Use “summer” and 3 as default values.

# hint at the bottom of this file


# 2. We look at the data from the beehives again. 
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.

# 3. Difficult-follow-up: In a previous exercise, you wrote some code to read
# in all files within a directory. 
# Convert the code to a function.








# Hints
# hint "Functions" 1: in case you did not find the solution. This is the solution:

# Watch out this is a solution for another task!


s = "summer"
s1 = ""

for i in range(len(s)):
    for j in range(3):
        s1 += s[i]

print(s1)
sssuuummmmmmeeerrr
# 1. In one of the previous tasks you wrote some code to
# Repeat each character of the string "Summer" 3 times.
# Convert it into a function. It should take two parameters:
# - a string, e.g. “summer”.
# - A number, e.g. 3 to specify how often each letter 
#   should be repeated.
# Use “summer” and 3 as default values.
# hint: in case you did not find the solution. This is the solution:

# Watch out this is a solution for another task!

s = "summer"
s1 = ""

for i in range(len(s)):
    for j in range(3):
        s1 += s[i]

print(s1)


def funny_words(word="summer", repeat_times=3):
    s1 = ""
    for i in range(len(word)):
        for j in range(repeat_times):
            s1 += word[i]
    return s1



# 2. We look at the data from the beehives again. 
# Given a beehive, how many observations do we have
# for that beehive? How many observations are missing
# in the column “weight_kg”? Write a function for this
# problem. Return both values in a named list.
# Use the dataframe and the hive number as input parameters.

import pandas as pd

df = pd.read_csv("beedata.csv")

def beehive_info(df, hive):
    sub = df[df['hive'] == hive]
    n = len(sub)
    missing = sub['weight_kg'].isna().sum()
    return {'n': n, 'missing': missing}

# examples
result_4 = beehive_info(df, 4)
result_13 = beehive_info(df, 13)

print(result_4)
print(result_13)




# 3. In a previous exercise, you wrote some code to read
# in all files within a directory. 
# Convert the code to a function.


import os
import pandas as pd

def combine(dir="forarexData/", final_name="forarexCombined.csv"):
    # list all files in directory
    files = os.listdir(dir)
    files = [os.path.join(dir, f) for f in files]
    
    # crreate new and empty dataframe
    df = pd.DataFrame()
    for file in files:
        tmp = pd.read_csv(file)
        df = pd.concat([df, tmp], ignore_index=True)
    
    df.to_csv(final_name, index=False)

# call function
combine()

5.9 Debugging

Here is a buggy version of the function “funny.words.”

funny.words <- function(s = "summer", count = 3){
  s = "summer"
  s1 <- ""
  for(i in 1:2){
    for(i in 1:count){
      print(substr(s, i, i))
      s1 <- paste(s1, substr(s, i, i), sep = "")
    }
  }
  return(s1)
}
funny.words()
[1] "s"
[1] "u"
[1] "m"
[1] "s"
[1] "u"
[1] "m"
[1] "sumsum"
funny.words("banana", 5)
[1] "s"
[1] "u"
[1] "m"
[1] "m"
[1] "e"
[1] "s"
[1] "u"
[1] "m"
[1] "m"
[1] "e"
[1] "summesumme"

Since this code consists of a function and a loop, we cannot run line by line to find the mistakes. However, we can create breakpoints by clicking next to the line numbering (left).

Then we can click on “source”” to execute all code until the first breakpoint.

In the console, there will be a new set of buttons:

  • If you click on Next, R will execute the next line. (e.g. line 6)
  • If you click on the second button, R will step in the current function call, so it will basically jump into an other function. (e.g. into the print function)
  • If you click on the third button, R will execute the rest of the current function or loop. (e.g. line 6 and 7)
  • If you click on “continue”“, R will run until we come across the next breakpoint. (e.g. in the next round of the loop or in line 10)
  • If you click on “Stop”, R will exit the debug mode.

Here is a buggy version of the function “funny_words.”. We are using the package ipdb for debugging.
Implement ipdb.set_trace() into the function to set a Breakpoint in the execution of the loop. You can step through the code and print out variables.


# import ipdb

def funny_words(s = "summer", count = 3):
    s = "summer"
    s1 = ""
    
    for i in range(2):
        for i in range(count):
            # ipdb.set_trace()  # Pause the execution to inspect 's', 'i' and 'i'
            print(s, i, i)

    s1 =[s, i, i] 

    return s1

try out:

funny_words()
summer 0 0
summer 1 1
summer 2 2
summer 0 0
summer 1 1
summer 2 2
['summer', 2, 2]

and

funny_words("banana", 5)
summer 0 0
summer 1 1
summer 2 2
summer 3 3
summer 4 4
summer 0 0
summer 1 1
summer 2 2
summer 3 3
summer 4 4
['summer', 4, 4]

You can use the following debugging commands to e.g. print variables or continue the loop until the next breakpoint:
- p : print the value of a variable
- c: continue to the next breakpoint
- n: move to the next line of code
- q: quit the debugging session

5.10 Piping

Some people complain about all the brackets in the R syntax. With using pipes you can get rid of them. Pipes might also reflect your workflow more naturally than the usual syntax.

You can find the standard pipe operator in multiple packages, but you will get more options when using the magrittr packages.

Here is a simple example

library(magrittr)
x <- 9
# Calculate the square-root of x
sqrt(x)
[1] 3
# Calculate it using pipes
x %>% sqrt
[1] 3

In case you want to update x, you can use:

x <- 9
# Calculate the square-root of x and update x
x <- sqrt(x)
x
[1] 3
# Calculate it using pipes
x <- 9
x %<>% sqrt
x
[1] 3

This is an example with two functions and additional arguments:

nrow(subset(df, hive==4))
[1] 60193
df %>% subset(hive==4) %>% nrow
[1] 60193

There is no equivalent of piping in python.

6 PANGAEA - downloading data

An example script to download data from PANGAEA via the toolbox PANGAEAr is in github.

An example script to download data from PANGAEA via the toolbox PANGAEApy is in github.

7 Data Cleaning

An example script to clean up data is in github.

An example script to clean up data is in github.

8 Statistics

8.1 First Step: Look at your data (Descriptive Statistics)

  • What is your response variable?
  • Plot your data
    • Plot raw data
    • Frequency distribution -> Are there any patterns?

8.2 Definition - Simple Linear Regression

\[ \begin{equation} y_i = \beta_0 + \beta_1x_i + \epsilon_i, \qquad i=1,...,n \end{equation} \] where \(y\) is the dependent variable, \(x\) is the independent variable (also called explanatory variable), \(\beta_0\) is the intercept parameter, \(\beta_1\) is the slope parameter and \(\epsilon\sim N(0,\sigma^2)\) is the error coefficient.

8.3 Sample data

set.seed(123)
x <- seq(0,5,0.1) 
y <- x + rnorm(length(x))
plot(x, y)

hist(y)

8.4 Fitting a Linear Model

mod <- lm(y~x)
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0116 -0.6110 -0.0912  0.6575  2.1444 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.05829    0.25565   0.228    0.821    
x            0.99216    0.08812  11.259  3.4e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9263 on 49 degrees of freedom
Multiple R-squared:  0.7212,    Adjusted R-squared:  0.7155 
F-statistic: 126.8 on 1 and 49 DF,  p-value: 3.397e-15
plot(x,y)

#abline(mod)
library(ggplot2)
ggplot(mapping= aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

8.5 Overall check for assumptions

plot(mod)

Produces series of graphs (first two are the most important)

8.5.1 1st graph: Constancy of variance

        Plot of the residuals against the fitted values
        → Should look like the sky at night
        

8.5.2 2nd graph: Normality of errors

Normal QQ plot = plot of the ordered residuals against the fitted values → Should be a reasonably straight line

8.6 Linear Model - Assumptions

  • Linear relationship between x and y
  • Independence of residuals
  • Constant variance of residuals
  • Normality: Residuals are normally distributed

8.7 Assumption 1 - Linear Relationship

  • use a scatterplot to check
  • if this assumption is violated we have unrealistic predictions and estimates of the standard deviation
  • How to deal with this? You can transform the variables or alter the model.
plot(x,y)

8.8 Assumption 2 - Independence of Residuals

  • Plot residuals in time/ observation order / based on observation location / …
  • You expect that they are equally scattered around 0
  • You can also use the Durban-Watson-Test
  • Autocorrelation can have serious effects on the model
  • You can add an AR correlation term to the model. Sometimes it also helps to add a further covariate.
res <- residuals(mod)
plot(x, res)

8.9 Durban-Watson-Test

  • H0: There is no correlation among the residuals.

  • HA: The residuals are autocorrelated.

library(car)
Loading required package: carData
durbinWatsonTest(mod)
 lag Autocorrelation D-W Statistic p-value
   1      0.02431999      1.940949   0.712
 Alternative hypothesis: rho != 0
  • The p-value is larger than 0.05, so we cannot reject the null hypothesis that there is no correlation between residuals.

8.10 Assumption 3 - Constant variance or errors

  • Plot fitted values vs residuals
  • You expect that they are equally scattered around 0
plot(fitted(mod),res)

8.11 Assumption 4 - Normality of Residuals

  • Use qq-plot to compare quantiles.
  • Or use the Shapiro-Wilk-Test
  • If you do not find a Normal Distribution, check for outliers or transform your variable.
qqPlot(res)

[1] 44 18

8.12 Shapiro-Wilk-Test for Normality

  • H0: The data is normally distributed
  • HA: The data comes from an other distribution
shapiro.test(res)

    Shapiro-Wilk normality test

data:  res
W = 0.99152, p-value = 0.9732
  • We cannot reject the null-hypothesis that the data is normally distributed.

8.13 If model doesn’t seem to fit your data:

8.13.1 Is your response really continuous?

If not another model might be more appropriate: - Continuous: linear model → LM e.g. height, weight - Count: Poisson model → GLM e.g. number of individuals - Binary: Binomial model → GLM e.g. presence / absence

8.13.2 Are there obvious patterns?

For example many zeros: You could separate in two analysis: Binomial model for success / failure and linear model for successes

8.13.3 Are there outliers that should be omitted?

4th graph of plot(model): Plot of Cook´s distances versus raw data → Highlights the identity of particularly influential data points

8.13.4 Relationship not a straight line?

Polynomial regression e.g. a quadratic function might be an option

 lm(y ~ x + I(x^2))

8.13.5 Transformation

You could for example use a log transformation of the response variable

    lm(log(y) ~ x)

8.13.6 Are there other explanatory variables that you should include?

SHOW ON PENGUIN EXAMPLE (include species as categorical variable)

8.14 Definition - Linear Model with multiple predictors

\[\begin{equation}\label{eqn:linearregression} y_i = \beta_0 + \beta_1x_{1,i} + ... + \beta_px_{p,i} + \epsilon_i, \qquad i=1,...,n \end{equation}\] where \(y\) is the dependent variable, \(x_1 ... x_p\) are the independent variables (also called explanatory variables), \(\beta_0 ... \beta_p\) are the regression coefficients, \(\epsilon\sim N(0,\sigma^2)\) is the error coefficient and \(p \geq 1\).

8.15 The penguin data set

Artwork by @allison_horst

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.

https://allisonhorst.github.io/palmerpenguins/

library(palmerpenguins)

Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':

    penguins, penguins_raw
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

https://github.com/mcnakhaee/palmerpenguins?tab=readme-ov-file

from palmerpenguins import load_penguins
/home/diseng001/R/x86_64-pc-linux-gnu-library/4.5/reticulate/python/rpytools/loader.py:120: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _find_and_load(name, import_)
penguins = load_penguins()
penguins.head()
  species     island  bill_length_mm  ...  body_mass_g     sex  year
0  Adelie  Torgersen            39.1  ...       3750.0    male  2007
1  Adelie  Torgersen            39.5  ...       3800.0  female  2007
2  Adelie  Torgersen            40.3  ...       3250.0  female  2007
3  Adelie  Torgersen             NaN  ...          NaN     NaN  2007
4  Adelie  Torgersen            36.7  ...       3450.0  female  2007

[5 rows x 8 columns]

8.16 Exercise

  1. Using the penguin data set, fit a linear model to determine the relationship between the body mass index and the flipper length.

  2. The flipper length differs accross penguin species. How could you add this to your model?

8.16.1 Solution

Remove missing values:

library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:car':

    recode
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
penguins_filtered <- penguins %>%
  filter(complete.cases(.))

Fit a linear model:

y <- penguins_filtered$body_mass_g
x <- penguins_filtered$flipper_length_mm

mod <- lm(y~x)
summary(mod)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1057.33  -259.79   -12.24   242.97  1293.89 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5872.09     310.29  -18.93   <2e-16 ***
x              50.15       1.54   32.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.3 on 331 degrees of freedom
Multiple R-squared:  0.7621,    Adjusted R-squared:  0.7614 
F-statistic:  1060 on 1 and 331 DF,  p-value: < 2.2e-16

Check linear relationship:

plot(x,y)

Check independence of residuals

res <- residuals(mod)
plot(x, res)

library(car)
durbinWatsonTest(mod)
 lag Autocorrelation D-W Statistic p-value
   1     -0.06161287      2.115873   0.358
 Alternative hypothesis: rho != 0

Check constant variance or errors

plot(fitted(mod),res)

Check Normality of Residuals

qqPlot(res)

[1]  35 164
hist(res)

shapiro.test(res)

    Shapiro-Wilk normality test

data:  res
W = 0.99237, p-value = 0.08633

Fit a new model to consider penguin species as well.

mod_add <-  lm(penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm + penguins_filtered$species)

mod_spec <- lm(penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm * penguins_filtered$species)

summary(mod_spec)

Call:
lm(formula = penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm * 
    penguins_filtered$species)

Residuals:
    Min      1Q  Median      3Q     Max 
-900.90 -247.01  -25.53  197.19 1143.33 

Coefficients:
                                                                        Estimate
(Intercept)                                                            -2508.088
penguins_filtered$flipper_length_mm                                       32.689
penguins_filtered$speciesChinstrap                                      -529.108
penguins_filtered$speciesGentoo                                        -4166.117
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap     1.884
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo       21.477
                                                                       Std. Error
(Intercept)                                                               892.413
penguins_filtered$flipper_length_mm                                         4.692
penguins_filtered$speciesChinstrap                                       1525.110
penguins_filtered$speciesGentoo                                          1431.581
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap      7.864
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo         6.967
                                                                       t value
(Intercept)                                                             -2.810
penguins_filtered$flipper_length_mm                                      6.967
penguins_filtered$speciesChinstrap                                      -0.347
penguins_filtered$speciesGentoo                                         -2.910
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap   0.240
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo      3.083
                                                                       Pr(>|t|)
(Intercept)                                                             0.00525
penguins_filtered$flipper_length_mm                                    1.78e-11
penguins_filtered$speciesChinstrap                                      0.72887
penguins_filtered$speciesGentoo                                         0.00386
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap  0.81077
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo     0.00223
                                                                          
(Intercept)                                                            ** 
penguins_filtered$flipper_length_mm                                    ***
penguins_filtered$speciesChinstrap                                        
penguins_filtered$speciesGentoo                                        ** 
penguins_filtered$flipper_length_mm:penguins_filtered$speciesChinstrap    
penguins_filtered$flipper_length_mm:penguins_filtered$speciesGentoo    ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 368.4 on 327 degrees of freedom
Multiple R-squared:  0.7938,    Adjusted R-squared:  0.7906 
F-statistic: 251.7 on 5 and 327 DF,  p-value: < 2.2e-16
anova(mod_add, mod_spec)
Analysis of Variance Table

Model 1: penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm + 
    penguins_filtered$species
Model 2: penguins_filtered$body_mass_g ~ penguins_filtered$flipper_length_mm * 
    penguins_filtered$species
  Res.Df      RSS Df Sum of Sq     F   Pr(>F)   
1    329 45843144                               
2    327 44391669  2   1451475 5.346 0.005193 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model with the interaction term between flipper length and species improves the model.

9 Plotting

9.1 Standard library

# subset data
df.4 <- df[df$hive==4,]
# plot temperature outside
plot(df.4$time, df.4$t_o, ylim=c(0,40),type = 'p', pch=4)
# choose colours
cl <- rainbow(5)
# choose colums
cols <- 4:8
# plot each column
for (i in 1:5){
    lines(df.4$time, df.4[,cols[i]],col = cl[i],type = 'p', pch=4, ylim=c(0,40))
}
# add legend
legend("topright", legend=c(1, 2, 3, 4, 5, "outside"),
       col=c(cl, "black"), pch = 4, lty = 0, cex=0.8)

9.1.1 Points or Lines? (type)

  • “p”: Points
  • “l”: Lines
  • “b”: Both

9.1.2 Line type (lty)

http://www.sthda.com/english/wiki/line-types-in-r-lty

Source: http://www.sthda.com/english/wiki/line-types-in-r-lty ### Point types (pch) http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r

Source: http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r

9.2 ggplot

# subset data
df.4 <- df[df$hive==4,]
# choose columns
df.4.cols <- df.4[,c(1,4:9)]
# reshape data
library(reshape)

Attaching package: 'reshape'
The following object is masked from 'package:dplyr':

    rename
The following object is masked from 'package:lubridate':

    stamp
mdf <- melt(df.4.cols, id=c("time")) 
# plot data
library(ggplot2)
ggplot(data = mdf, aes(x=time, y=value)) + geom_line(aes(colour=variable)) + ylim(c(0, 40))

9.3 Exercises - Plotting and more

## Plotting

# 1. Choose a set of data from your background. 
# Decide for one of the presented plotting-libraries.
# Create a plot and add a title and custom x and y labels.
# In case you plotted multiple lines, add a legend.
# Difficult: Try to find out, how you can change the fontsize of the axis-labels.

# 2. Automatically save your plot to a given destination  

# 3. Try to make your code more abstract and useful. 
# Put your code within a function.
# - use the file-name (destination) as a parameter
# - In case you have multiple covariate, 
# use the covariate(s) to be plotted as input parameter
# - In case it makes sense to subset your data, introduce
#    a parameter for the choice of the subset.

# Always have in mind: if something goes wrong, you can use debugging.

# Exercises - Piping

# 1. Rewrite the following code using %>% and %<>%:

x <- 2
round(log(x))


# 2. Rewrite the second line of following code:

x <- rnorm(10,100)
round(sum(sqrt(x)), 3)


## Loops and functions again
# If you have some code you wrote in the past 
# for some exercise, for your studies, for your work...
# See if you can find pieces of code which you 
# just copied and pasted multiple times (don't worry, this is normal)
# Try to transform the code using loops and functions
## Exercises - Piping

# 1. Rewrite the following code using %>% and %<>%:

x <- 2
round(log(x))


x <- 2
x %<>% log %>% round

# 2. Rewrite the second line of following code:

x <- rnorm(10,100)
round(sum(sqrt(x)), 3)



x %>% sqrt %>% sum %>% round(3)

## Plotting

# 1. Choose a set of data from your background. 
# Decide for one of the presented plotting-libraries.
# Create a plot and add a title and custom x and y labels.
# In case you plotted multiple lines, add a legend.
# Try to find out, how you can change the fontsize of the axis-labels.

# 2. Automatically save your plot to a given destination  

# 3. Try to make your code more abstract and useful. 
# Put your code within a function.
# - use the file-name (destination) as a parameter
# - In case you have multiple covariate, 
# use the covariate(s) to be plotted as input parameter
# - In case it makes sense to subset your data, introduce
#    a parameter for the choice of the subset.

# Always have in mind: if something goes wrong, you can use debugging.

## Loops and functions again
# If you have some code you wrote in the past 
# for some exercise, for your studies, for your work...
# See if you can find pieces of code which you 
# just copied and pasted multiple times (don't worry, this is normal)
# Try to transform the code using loops and functions

10 Markdown and R-Markdown

Markdown is a great way to document your code and to write reports about your results. Here, we describe R-Markdown in detail. However, you can also use python. Especially easy to work with are jupyter notebooks, which can switch the formatting of individual cells between code and markdown. See this documentation for details.

10.1 Output formats

The three most common output options for R-Markdown are html, pdf and word documents. If you want to create pdf documents, tex needs to be installed.

Depending on your choice, your document will have a different output parameter in the header:

# html
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: html_document
---
# word
---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: word_document
---

There are many formatting options which you can specify in the header. This might be useful options for html:

---
title: "R Markdown Example"
author: "Somebody"
date: "August 14, 2019"
output: 
  html_document:
    toc: true # print a table of content
    toc_depth: 2 # maximal depth of the table of content
    number_sections: true # automatically number the sections
    code_folding: hide # have buttons to show/ hide code
    theme: united # specify a theme
    toc_float: true # generate a navigation bar

---

10.2 Document structure

If you create a new R-Markdown document, R will automatically create a sample document. You can see that “#” can be used to create headers:

# Header 1

## Header 2

### Header 3

10.3 Code blocks

For code blocks you have many options, too. The two most important are:

  • eval: If set to true, the code will be executed
  • echo: If set to true, the output will be printed/ plotted

10.4 Lists

* unordered list
    + sub-item 1
    + sub-item 2
        - sub-sub-item 1
* item 2

    
1. ordered list
2. item 2
    i) sub-item 1
  • unordered list
    • sub-item 1
    • sub-item 2
      • sub-sub-item 1
  • item 2
  1. ordered list
  2. item 2
    1. sub-item 1

10.5 Insert an image

![Some description](path to image)

10.6 Further documentation

A good documentation of R-Markdown can be found here: https://bookdown.org/yihui/rmarkdown/

And here is a cheat sheet: https://www.rstudio.com/resources/cheatsheets/#rmarkdown

There is a lot more you can do with R-Markdown, see: https://rmarkdown.rstudio.com/gallery.html

11 Assignment

  • you all get a confirmation of participation
  • you can submit a small assignment for 1 CP

11.1 Assignment

  • you can choose if you want to use R or Python
  • Create a markdown document
  • Choose a dataset from PANGEA
  • Think about the following question: What do you want fo find out? What is the motivation for your analysis?
  • Choose 5 of the following tasks. You can also come up with own ideas.
  • Write a short text for each task explaining what you have been doing

11.2 Ideas

  • Make some exploratory analysis:
    • print the mean/ median/ standard deviation for each column
    • print the total number of missing values
    • print the number of missing values for each column
  • Transform some of the columns/ Create new columns based on existing ones
  • Create a subset of your data
  • Exclude all rows with missing values
  • Create a plot
  • Write a function that takes a filename and some additional parameters. The function should create a plot and save it as “png” with that filename
  • Create a linear model

11.3 Send us an e-mail

  • If you want to do an assignment, please send us an e-mail:

  • “I want to do an assignment.””

  • “I do not want a grade on my certificate. / I want to have a grade on my certificate./ I want to have a grade if it is better than <1.3, 1.7, 2.0, 2.3, 2.7, 3.0, …>”

12 Short statistical breaks

12.1 Dinosours

https://www.autodeskresearch.com/publications/samestats

12.2 Correlation

https://www.tylervigen.com/spurious-correlations

12.3 Beginners Book

https://devhumor.com/content/uploads/images/July2020/Best-Programming-Book-for-Beginners-Comic---Art-of-Searching-on-Google.png